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Abstract 

In many learning tasks, structural models usually lead 
to better interpretability and higher generalization per- 


Indeed, there are plenty of machine learning models, 
which can be cast into the formulation in Q. 

• Generalized lasso model: all generalized lasso models 


formance. In recent years, however, the simple struc- 

such as the fused lasso ( 

Tibshirani et al. 2005 1 , the sparse 

tural models such as lasso are frequently proved to be 

group lasso (ISimon et a 

. 2013[l, the group lasso for logis- 


on “superposition-structured” models where multiple 
structural constraints are imposed. To efficiently solve 
these “superposition-sh'uctured” statistical models, we 
develop a framework based on a proximal Newton- 
type method. Employing the smoothed conic dual ap¬ 
proach with the LBFGS updating formula, we pro¬ 
pose a scalable and extensible proximal quasi-Newton 
(SEP-QN) framework. Empirical analysis on various 
datasets shows that our framework is potentially power¬ 
ful, and achieves super-linear convergence rate for opti¬ 
mizing some popular “superposition-structured” statis¬ 
tical models such as the fused sparse group lasso. 

1 Introduction 

In this paper, we consider the “superposition-structured” sta¬ 
tistical models ( Yang and Ravikumar 20 IJ] ! where multiple 
structural constraints are imposed. Examples of such struc¬ 
tural constraints include sparsity constraint, graph-structure, 
group-structure, etc. We could leverage such structural con¬ 
straints via specific regularization functions. Consequently, 
many problems of relevance in “superposition-structured” 
statistical learning can be formulated as minimizing a com¬ 
posite function: 


xGMP 


( 1 ) 


where 5 is a convex and continuously differentiable loss 
function, and lit is a hybrid regularization, usually defined 
as sum of N convex (non-smooth) functions. More specifi¬ 
cally. 


N 


T'(x) = ^ -f bi), 


each ifji is convex but not necessarily differentiable, S 
and hi e are available. Eor example, 'P(x) = 
Ai||x||i -f A 2 ||Fx||i -f AaV;^”^ l|G,x|j 2 defines a fused 


sparse group penalty (Zhou et al. 2012 1 when F is the dif¬ 
ference matrix and Gy indicates the group. 


tic regressionlMeier, Van De Geer, and Btihlmann 2008 1 


can be written as the following form: 

N 

min /(x) = p(x) -f Ai||Fx||i -f ^ Xj 




• Multi-task learning: given r tasks, each with sample ma¬ 
trix g ^ samples in the k-th task) and labels 

Jalali et ^proposed minimizing the following ob- 


r(k) 


jective: 


min/(x) = 




+ A1IISI 


A2IIBI' 


( 2 ) 


where Z(-) is the loss function and is the k-th column 
of S. Besides, more multi-task learning like the model in 
(Kim and Xing 20T0)| also could be cast into (0. 


Gaussian graphical model with latent variables: Chan 


drasekaran et al.| showed that the precision matrix will 
have a low rank + sparse structure when some random 
variables are hidden, thus the “superposition-structured” 
model will be much helpful. 

Moreover, many real-world problems benefit from these 
models such as Gene expression, time-varying network and 
disease progression. In this paper we mainly study the com¬ 
putational issue of the model in ( 0 . 

There are some generic methods that can be used to solve 
these models theoretically. The CVX ( [Grant and Boyd 2014 1 
is able to solve these models, but it is not scalable. The 


Primal-Dual approach proposed by Combettes and Pesquet 


(2012) can deal with these models, but it converges slowly. 
The smoothed conic dual (SCD) approach was studied in 
^Lan, Lu, and Monteiro 20TT| [Nesterov 20051 1 and [Becker, | 


Candes, and Grant 
complexity, 


( 2012 ) could obtain Olj) iteration- 
but It needs to find the minimizer related to 
g{x) in each iteration. In addition, the alternating direction 



































method of multipliers (ADMM) (Boyd et al. 2011 1 can also 
be used to solve this kind of problems. However, AD MM 
still suffers from the same bottleneck as the methods men¬ 
tioned earlier. Additionally, as we know that disk I/O is the 
bottleneck of computation, so it is important to reduce the 
number of evaluating g(x). In summary, it is challenging to 
efficiently solve the model on large-scale datasets. 

Recently, there has been a flurry of activity about devel¬ 
opments of Newton-type methods for minimizing compos- 
ite functions ([T]l in t he literature. In particular, in ( |Lee, Suri^ 
[and Saunders 2014)[Becker and Fadili 20I2| l the authors fo¬ 
cused on minimizing a composite function, which contains a 
convex smooth function and a convex non-smooth function 
with a simple proximal mapping. They also analyzed the 
convergence rate of various proximal Newton-type methods. 
ISchmidt, Kim, and Sra| (2011) discussed a projected quasi- 
Newton algorithm, but the sub-iteration procedure costs too 
much. [Hsieh et~ar (2014) further generalized the Newton 
method to handle some dirty statistical estimators. Their de¬ 
velopments “open up the state of the art but forbidding class 
of M-estimators to very large-scale problems.” In addition, 
there have already been plenty of packages that implement 
these Newton-type methods such as LIBLINEAR (|Fan et al. 
20081, GLMNET (| Fried man, Hastie, and Tibshirani 2009} 


Yuan, Ho, and Lin 2012[), but are limited to solve simple 


models such as lasso and elastic net. 

To solve the “superposition-structured” models in (0 on 
the large-scale problem, we resort to a proximal quasi- 
Newton method which converges superlinearly (Lee, Sun, 
[and Saunders 20141 1. We develop a Scalable and Extensi¬ 
ble Proximal Quasi-Newton (SEP-QN) framework to solve 
these models. More specifically, we apply a smoothed conic 
dual (SCD) approach to solving a surrogate of the origi¬ 
nal model Q. We employ the LBEGS updating formula, 
so that the surrogate problem could be solved not only effi¬ 
ciently but also robust. Moreover, we present several accel¬ 
erating techniques including adaptive initial Hessian, warm- 
start and continuation SCD to solve the surrogate problem 
more efficiently and gain faster convergence rate. 

In the following we start by presenting our SEP-QN 
framework for solving the “superposition-structured” statis¬ 
tical models. Then we present the approach to solve the sur¬ 
rogate problem, followed by theoretical analysis and con¬ 
cluding empirical analysis. 


2 The SEP-QN Framework 

In this section we present the SEP-QN framework for solv¬ 
ing the “superposition-structured” statistical model in (0. 
We refer to g{x.) as “smooth part” and 4'(x) as “non¬ 
smooth part.” Usually, g{x) is a loss function. Eor exam¬ 
ple, p(x) = ill Ih® squares re¬ 

gression problem where the G Ri' are input vectors 
and gi G R are the corresponding outputs, and g(x) = 
i log(l -I- exp(—2/iafx)) in the logistic regression 
where the gi G {—1,1}. We are especially interested in the 
large-scale case; i.e., the number of training data n is large. 


Basic Framework 

Roughly speaking, the method is built on a line search strat¬ 
egy, which produces a sequence of points {xf,} according 
to 

Xfe+i =Xk+ ffeAxfc, 

where ffc is a step length calculated by backtrack, and Ax^ 
is a descent direction. We compute the descent direction by 
minimizing a surrogate of the objective function /. Given 
the fcth estimate x^ of x, we let (x) be a local approxima¬ 
tion of / around x^. The descent direction Ax^ is obtained 
by solving the following surrogate problem: 

min fk{x). (3) 

X 


Proximal Newton-type methods approximate only the 
smooth part g with a local quadratic form. Thus, in this paper 
the surrogate function is defined by 


/fc(x) = pfc(x) -f T'(x) 

= 5(xfe)-|-Vp(xfc)^(x-Xfc) 

+ ^(x-Xfc)'^Hfc(x-Xfe)-I-T'(x), (4) 

where H^, is a p x p positive definite matrix as approxima¬ 
tion to the Hessian of p at x = x^. There are many strategies 


for choosing Hfc, such as BEGS and LBEGS (Nocedal and 
Wright 2006|l. Considering the use in the large-scale prob¬ 


lem, we will employ LBEGS to compute H*,. 

After we have obtained the minimizer x^ of ([^, we use 
the line search procedure such as backtracking to select the 
step length tk such that a sufficient descent condition is sat¬ 
isfied ( |Lee, Sun, and Saunders 20T2) . That is, 

f{xk+tkAxk)<fixk) + atk^k, (5) 


where a G (0,1/2), Axj, A — Xk, and 

7fc - Vp(xfc)^Axfe -f T'(xfe -f Axk) - 4'(xfc). 

Algorithm [T] gives the basic framework of SEP-QN. The 
key is to solve the surrogate problem ([^ when there are mul¬ 
tiple structural constraints. In Algorithm we present the 
method of solving the problem ([^. Moreover, we develop 
several techniques to further accelerate our method. Specif¬ 
ically, we propose an acceleration schema by adaptively ad¬ 
justing the initial Hessian Hq in Algorithm]^ We will see 
that with an appropriate Hq, can be a better approxima¬ 
tion of V^p(xfc), leading to a much faster convergent proce¬ 
dure. 


The Solution of the Surrogate Problem (|^ 

If there is only one non-smooth function in 'I'(x) (i.e., A^=l) 
with simple proximal mapping, we can solve the surrogate 
problem (|^ directly and efficiently via various optimal first- 
order algorithms such as EISTA (Beck and Teboulle 20091 
and coordinate descent whic h is used in LIBLINEAR (|Pan 
et al. 20()8]l and GLMNET (|F riedman, Hastie, and Tibshi- 


rani 2009 Yuan, Ho, and Lin 20I2|. In this paper we mainly 


consider the case that there are multiple non-smooth func¬ 
tions. In this case, we could use SCD or ADMM to solve 
the problem. Since we empirically observe that SCD outper¬ 
forms ADMM, we resort to the SCD approach. 









































Algorithm 1 The SEP-QN Framework 

Require: xq and Hq 

Ensure: Xq S dom/, and Hq is a scaled identity matrix 
(positive definite). 

1 : 5^ [],y ^ [], and 13 ^2 

2: repeat 

3: Update using LBFGS, where is symmetric 

positive definite. 

4: Solve the problem in ([^ for a descent direction: 

Axfc ^ argmin/fe(xfc + A) (Alogrithm|^ 

5: Search tj- with backtracking method. 

6 : Update x^+i ^ Xfc + ffcAxfc 

7: if (xfe+i - Xfc)^(V 5 (xfc+i) - Vp(xfc)) > 0 then 

8 : S ^ [S, Xfe+i - Xfc] 

9: Y ^ [F, Vp(xfc+i) - V 5 (xfe)] 

10: Ho ^ Ada_Hess{tk,l3,:>ik+i - yik,^g{^k+i) - 

Vff(xfe),Ho) (Algorithmic 
11 : end if 

12 : until stopping condition is satisfied 


The SCD Approach In order to solve the problem ([C ef¬ 
ficiently when A > 1, we employ the SCD approach. The 
main idea is to solve the surrogate problem via its dual. 

We first reformulate our concerned problem in ([C into the 
following form: 

N 

min = 5 /c(x)( 6 ) 

s.t. (WiX + bj,U) e 

where iv = (x, U, U are new scalar variables, 

and K.^- is a closed convex cone (usually the epigraph 
^i(WiX -b bi) < ti). Since projection onto the set 
{x|(W,x + bi, ti) G might be expensive, we address 
this issue by solving the dual problem. 

We denote the dual variables by A = (zi, ri,..., z^r, tn), 
z = (zi,..., zjv), where (z^, Ti) G IC^.. And is the dual 
cone defined by 

= {x : x^y > 0 for all y G /C^J. 


The Fagrangian is unbounded unless = 1. Because the 
appropriate Hessian matrix is positive definite, this problem 
strongly convex, guaranteeing the convergence rate. 

Denote I?”(z) = —I?(A) = —I?(zi, 1,..., z^r, 1), and 
suppo se x(z) is the unique Fagrangian minimizer. |Nes- 


terov 


(2005) proved that D“(z) is convex and continu¬ 
ously differentiable, and that VI?“(z) = (Wix(z) -f 
bi,..., WAfx(z) -b b^r)^ is Fispchitz continuous. Thus, 
provably convergent and accelerated gradient methods in the 
Nesterov style are possible. 

In particular, we need to minimize (z). A standard gra¬ 
dient projection step for the smoothed dual problem is 

;_z(i) + j(i)VD-(z(^))|| 2 . ( 8 ) 


2 .(i+i) = argmin 


z:(zi,l)G/C; 


Ipi 


Then we need to obtain x(z(^)) and VD (z^^^). By sub¬ 
stituting pfc(x) into collecting the linear and quadratic 
terms, and eliminating the unrelated terms, we get the re¬ 
duced Fagrangian 


1 ^ 

X>(z) = inf |-x^Hi,x-bx^(Vg(xj;)-HfcXfc-^Wfzi)|. 

i=l 

The minimizer x(z('^^) is given by 

N 

x(z(^)) = -H^i(Vp(xfe) - HfcXfc - ^ Wfzp)). (9) 

From ^ and the minimization problem over z is 
separable, so it can K implemented in parallel. The solution 
is given by 


(j+i) 

z] = argmin 

z^:(zj,l)G/C 




1 


Zi —z 


(i)l|2 


2-bZi (Wix(z^'’') + bi). 


( 10 ) 


From (|9| and (10 1 , we obtain the specific AT method 
( Auslender and Teboulle 2006) 1 to solve the problem (|^ in 
Algorithmic 

There are many variants of optimal first-order methods 
( Fan, Fu, and Monteiro 2 01 1[ [Beck and Tehoulle 2009[ 
[Nesterov 2007PTseng 2008l l. Algorithm |2| is a generic algo- 
rithm but may not be the best choice for every model. By us¬ 
ing the continuation techniques(|Becker, Candes, and Grant 


| 2012 |l, we could obtain the exact solution very quickly. 


Fet us take an example in which '0i(x) = ||WiX -b bi||i < 
ti. Then/C^, = {(W^x -b bi,U) : ||WiX -b bji < U} and 
= {(Zi.Ti) : INilloo < Ti}. 

The Fagrangian and dual functions are given by 

N 

C{v] A) = pfe(x) -b - zf (WiX -b bi) - TiU), 

i=l 

'D{\) = inf |£(x, ti] Zi, Ti) 

N 

= 9k{^) + zf (W,x -b b,) - nti)y (7) 

i=l 


Acceleration 

We further employ several acceleration techniques in our 
implementation. By applying these techniques we achieve 
much faster convergence rate which is comparable to the 
conventional proximal Newton method. Our accelerated im¬ 
plementation behaves much better than the original proximal 
quasi-Newton method in various aspects. 

Adaptive Initial Hessian FBFGS sets the initial Hessian 

T 

Ho as However, we find that this setting results in 

a much slower convergence procedure than the proximal 
Newton method. Thus, it is desirable to give a better initial 
Hessian Hq, which in turn yields a better approximation of 

V^p(xfe). 






























Algorithm 2 Solve Problem ([^ via SCD 


Require: Xfc, S', y, Ho, Vg(xfc), zf \ Xo 
1: ^ ^ ^ 0 

2: repeat 

3: ^ (1 - 6'(^))vp^ + 

4: x^-H^i(V5(xfe)-HfcXfe-Eti 


5: 

6 : 


LBFGS method. 

for i ^ 1, do 

„(j+i) , 


argmin 

Zi:(2i,l)e/CJ 


flO) 

250) 


Zi 


Wfyp>)by 

- zp'^lli + 


zHw.x + bO 

7: ^ (1 - + 6»(^')zp+^^ 

8 : end for 

9: ^2/(1 + (1+4/(6»(J'))2)5) 

10) J ^ j + 1 

11: until some stopping condition is satisfied 
12: A •(— X — Xfc 

13: return A 


Theorem 1. If (1 - a)Hfc ^ W'^g{xk) for a e (0, \), 
H/c ^ ml, (to > 0) and "S/^g is Lipschitz continuous with 
constant L 2 , then the unit step length satisfies the sufficient 
decrease condition after sufficiently many iterations. 

Theorem 2. Assume and are generated by the same 
procedure {{sk,yk) ■ sjyfc > 0} but with different initial 
Hessians Hg and Hq, respectively. If Hg Hq 0, then 
^ H^. ^ 0. 

Based on Theorems [T] and we can decrease Hq more 
aggressively. Once the unit step fails, we know that (1 — 
a)Hfe V^(jf(xfc) is broken; hence we need to increase Hq. 
We propose our adaptive initial Hessian strategy in Algo¬ 
rithm]^ In practice, we set a to a small number like 0.0001. 


Algorithm 3 Adaptive Initial Hessian 
1: procedure ADA_HESS(tfc, /3, s^, y^, Hq) 
2: if tk < 1 then 

3: Hq = Ho/ffe 

~ 1+1113 

5: end if 

T 

6: Ho = elementwise_min(^|B-, I) 

7: return Ho 

8 : end procedure 


Warm start and continuation SCD We use the optimal 
dual value z^ which is obtained in solving dual of fk (x) as 
the initial dual value to solve dual of fk+i{'x.). This leads 
to a warm start in solving the problem ([^, and the iteration 
complexity will be dramatically reduced. 

By employing continuation SCD to solve the problem Q, 
the dual of the original problem <01 could reach e optimal 

within — z^_|_j^|j 2 ) iterations which shows in 

( [Nesterov 2005) 1. 


3 Theoretical Analysis 

In this section we conduct analysis about the convergence 
rate of SEP-QN method. Because of space limitations, we 
give the detailed proofs in the supplementary. In order to 
provide the global convergence and solve the problem effi¬ 
ciently, we make the following assumptions; 

Assumption 3. / is a closed convex function and 
inf{/(x)|x G dom/} is attained at some x*. 

X 

Assumption 4. The smooth part g is a closed, proper con¬ 
vex, continuously differentiable function, and its gradient 
Vp is Lipschitz continuous with Li. 

Assumption 5. The non-smooth part 47 should be closed, 
proper, and convex. The projection onto the dual cone as¬ 
sociated with each ipi is tractable, or equivalently, easy to 
solve problem 

First, we analyze the global convergence behavior of SEP- 
QN under these assumptions. 

Theorem 6. If the problem (|^ is solved by continuation 
SCD, then {x^} generated by the SEP-QN method con¬ 
verges to an optimal solution x* starting at any Xq £ dom/. 

Under the stronger assumptions, we could derive the lo¬ 
cal superlinear convergence rate as shown in the following 
theorem. 

Theorem 7. Suppose g is twice-continuously differentiable 
and strongly convex with constant I, and is Lipschitz 
continuous with constant L 2 . If Xq is sufficiently close to x*, 
the sequence {Hfc} satisfies the Dennis-More criterion, and 
11 Hfc ^ LI for some 0 < I < L, then SEP-QN with the 
continuation SCD converges superlinearly after sufficiently 
many iterations. 

Remark 8. Suppose SEP-QN converges within T iterations. 
If the dataset is dense, then the complexity of SEP-QN is 
TO{np)-\-TO{j-{Mp-\-'Yj^—i Qip)); if the dataset is sparse, 

the complexity is TO{nnz) -f TO{j-{Mp -f J2f=iQiP))’ 
where M is the history size of LBFGS, Cs is the tolerance 
of the problem •mi and nnz is the amount of non-zero entries 
in the sparse dataset. 

We require that n or nnz are relatively large, otherwise 
the complexity of the problem ([^ will go over the complex¬ 
ity of evaluating the loss function. In this case, it would be 
better to use some first-order methods instead of SEP-QN. If 
ignoring the impact of T and the dataset is dense, the conver¬ 
gence time of SEP-QN is linear with respect to the number 
of features, the amount of data size, and the number of non¬ 
smooth terms. We will empirically validate the scalability 
and extensibility of SEP-QN in the following section. 

4 Empirical Analysis 

We implement all the experiments on a single machine run¬ 
ning the 64-bit version of Linux with an Intel Core 15-3470 
CPU and 8 GB RAM. We test the SEP-QN framework on 
various real-world datasets such as gisette (n = 6,000 
and p = 5,000) and epsilon (n = 300,000 and p — 




















(a) epochs of -logistic regression 


(b) runtime of -logistic regression 


(c) fused £i-logistic regression 


Figure 1: Convergence comparison 


2, 000) which can be downloaded from LIBSVM websit^ 
The dataset characteristics are provided in the Table [T] 


Table 1; Details of the datasets in our experiments 


Dataset 

P 

n (train) 

n (test) 

nnz (train) 

epsilon 

gisette 

usps 

2,000 

5,000 

649 

300,000 

6,000 

1000 

100,000 

1,000 

1000 

600,000,000 

29,729,997 

649,000 


“Superposition-structured” Logistic Regression 

We consider the “superposition-structured” logistic regres¬ 
sion problem: 

1 " 

-y!log(l + exp(-2/iafx)) + A||x||i -f 7||Wx||q. 
xgRp n ^' 

i—l 


We hrst set W = 0 for comparison with the result of PQN 
in (iLee, Sun, and Saunders 2012 1 . For fairness of compari¬ 
son, we use the same dataset gisette and the same setting 


of the tuning parameter A as (Lee, Sun, and Saunders 2012 
Yuan, Ho, and Lin 2012|. The results are shown in the Fig- 


ures |l(a)| and |l(b)| We can see that the SEP-QN method has 
the fastest convergence rate, which agrees with Theorems 
and|7] 

In order to verify the effectiveness and efficiency when the 
model has multiple structural constraints.We compare SEP- 
QN with ADMM and the direct SCD in TEOCS ( |Becker,| 


Candes, and Grant 2012 1 on the fused sparse logistic regres¬ 


sion by setting ||Wx||g = HxHtv and 7 = A. Eigure 1(c) 
shows that the three algorithms converge to the same optimal 
value, but SEP-QN performs much better. 


Multi-task Learning 

Next we solve the multi-task learning problem where the 
parameter matrix X will have a sparse + group sparse struc¬ 
ture. In our framework, there is no ne ed to seperate the 
parameter matrix X into S -t- B as in (|Jalali et al. 2010 


IHsieh et al. 2014|l. Instead of using the square loss 
(jjalali et al. 201(5 1 ), we consider the logistic loss. 


loss (as in 
which 


'http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets 




10 20 30 40 50 60 


sparse -i group sparse features 


Eigure 2: Eeature visualization. As shown in the colorbar, or¬ 
ange color indicates that the value of corresponding feature 
is 0 . 


gives better performance. Thus, X could be estimated by 
the following objective function, 

r 

min ^ liogistic{y^^\ + A||X||i + 7||X||i,2. 

k = l 

We follow Palali et al. 2010[ |Hsieh et al. 2014[ ) and trans¬ 
form multi-class problems into multi-task problems. Eor 
fairness of comparison, we test on the same dataset USPS 
which was hrst collected in ( jVan Breukelen et al. 199^ 
and subsequently widely used in multi-task papers as a re¬ 
liable dataset for handwritten recognition algorithms. There 
are r = 10 tasks, and each handwritten sample consists of 
p = 649 features. In Palali et al. 2010[[Hsieh et al. 2014[ ), the 










































































Table 2: The comparisons on multi-task problems. 


n 

relative 

error 

sparse + group sparse 
(test etTor rate / training time ) 

Other Models 

SEP-QN 

QUIC & DIRTY 

ADMM 

Lasso 

Group Lasso 

100 

10 -^ 

7.3%/0.32s 

8.3%/0.42s 

8.3%/ 1.5s 

7.9% 

7.4% 

Tdr"* 

6.4% / 0.93s 

7.4% / 0.75s 

7.5%/4.3s 

400 

10 -^ 

3.0%/ 1.2s 

2.9%/ 1.01s 

3.0% / 3.6s 

3.0% 

3.1% 


2 .6% / 2.0s 

2.5%/ 1.55s 

2 .6%/ 11.0s 





(a) feature-number scalability (b) data-size scalability (c) nonsmooth-terms extensibility 


Figure 3; Scalability and extensibility 


authors demonstrated that on USPS, using sparse and group 
sparse regularizations together outperforms the models with 
a single regularizer. 

We visualize features that estimated by our SEP-QN 
framework in Figure and we just plot the first sixty fea¬ 
tures to provide a clear visualization. Figurej^shows that the 
feature structure is well maintained by the regularizer. The 
promising results of “sparse + group sparse structure” fur¬ 
ther validate the effectiveness of our SEP-QN framework. 
As shown in Table our SEP-QN framework is compa¬ 
rable to QUIC & DIRTY which is the state-of-art method. 
Unlike QUIS & DIRTY, our implementation is straightfor¬ 
ward in the SEP-QN framework. Because of broad interest 
of our framework, it may be slower than QUIC & DIRTY 
on some specifical datasets. However, we will show that our 
framework is scalable by the experiments in the following 
section. 


Scalability and Extensibility 

We consider the group generalized lasso problem ([Tibshi 


rani et al. 2005t|Simon et al. 20131[Meier, Van De Geer, and 


Biihlmann 20081, but use the logistic loss function instead 
for classification. Specifically, 


1 . ^ 

-y"log(l-bexp(-j/iafx)) 

2 = 1 

N-2 

+ Ar||x||i + A 2 ||Fx||i + ^ 7jj|Gjx|j2. 

i=i 


use this model to validate the scalability and extensibility of 
SEP-QN framework. 

We use the fused sparse logistic regression (Ai = f, A 2 = 
= 2) and gisette dataset to test the feature-number 
scalability of SEP-QN as shown in Figure [3(a)1 Then we test 
the data-size scalability on epsilon dataset as shown in 
Figure 3(b)| We can see that the convergence time is linear 
with respect to the number of features as well as the amount 
of data. 

Then we use the group sparse logistic model (Ai = 
^,A 2 = 0 , 7 y = and epsilon dataset to test the 
nonsmooth-terms extensibility of SEP-QN. As shown in 
Figure |3(c)[ the convergence time is linear with respect to 
the number of non-smooth terms. These experiments further 
verify Remarkj^under the assumptions. 


5 Conclusion 

In this paper, we have generalized the proximal quasi- 
Newton method to handle “superposition-structured” statis¬ 
tical models and devised a SEP-QN framework. With the 
help of the SCD approach and LBFGS updating formula, 
we can solve the surrogate problem in an efficient and fea¬ 
sible way. We have explored the global convergence and the 
super-linear convergence both theoretically and empirically. 
Compared with prior methods, SEP-QN converges signifi¬ 
cantly faster and scales much better, and the promising ex¬ 
perimental results on several real-world datasets have fur¬ 
ther validated the scalability and extensibility of the SEP- 
QN framework. 


As far as we know, there is no an efficient algorithm to 
solve this model. Note that this dirty model may not be a 
good choice for gisette and epsilon datasets. We just 
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Appendix A 
Proof of Theorem [I] 

Lemma A.l. If H^. is positive definite, then Ax^ satisfies 

V5(xfe)'^Axfc + ^'(xfe + Axfc) - ^'(xfc) < -Axfc Hfe Axfe 


Proof of Theorem 12 

Theorem 2. Assume and are generated by the same 
procedure {{sk,yk) '■ > 0} but with different initial 

Hessians Hg and Hq, respectively. If Hg Hg 0, then 

0 . 


The proof of this lemma is shown in ( |Lee, Sun, and Saum 
|ders 2014] l. 

Theorem 1. 7/(1 - Q;)Hfe ^ V^p(xfe) for a G (0, i), 
Hfe >p ml, (to > 0) and is Lipschitz continuous with 
constant L 2 , then the unit step length satisfies the sufficient 
decrease condition after sufficiently many iterations. 

Proof. By Lemma [AT| we have 

Vs((xfe)^Axfe + 'It(xfe + Axfe) - tl>(xfc) + ^Ax^HfeAxfc 
< -^AxfeHfeAxfe < 0. 


Proof. By assumptions > 0 and Hq 0, we can 

prove this result using the BFGS updating formula, 






yfcSfc 


yfeSfc 


yl^k 


By reduction, if 0, 

yi sfc yi ^/c 


is positive definite when s^yk > 0, then ^ so 

with larger initial Hg, we could obtain larger H^. □ 


Since 7 ^ = 'Vg{xk)'^Axk + ^'(xfc + Ax^) - T'(xfc) (the 
sufficient descent condition (|^), we have 

Vgr(xfc)^Axfc + ^'(xfc + Axfe) - 4'(xfe) + ^ ^ '^ AxfcHfcAxfe 

< ayk 

gfx.k) + Vfir(xfc)^Axfc + + Axfc) + ^ AxfcHfcAxfe 

< g(xfc) +’l'(xfc) + a7fe. (A.l) 

Since is Lipschitz continuous with constant L 2 , the 
smooth part g{x) can be expanded in Taylor’s series as fol¬ 
lowing 

/(xfe + Axfc) = g{-x.k) A- Vp(xfc)^Axfe -I- ^Ax^V^sr(xfe)Axfc 
-I- o(Axfc) + 5'(xfc + Axfe) 

< g{-s.k) + V3(x*,)^Axfc -I- ^Axfe V^sr(xfc)Axfc 

-I-^11 Axfclla +'It(xfe + Axfc). (A.2) 

Since Hj. )p toI, from Lemma |AT| we have 

\\A^k\\l<- — - (A.3) 

m 

Because (1 — a)Hfe V^p(xfc), we use the results in 
( |A.l| l, ( |A.2| l and ( |A.3| l to yield 

/(xfe -f Axfe) < 5(xfc) + V5(xfe)^Axfc -f T'(xfc + Axfc)+ 

^^Ax^HfeAxfe + ^|lAxfc||3 
Z 0 

< /(xfe) + ayk + ^IIAxfell^ 

6 

L 2 

< /(xfe) + ayk - —II Axfe||27fc. 

om 

We can show that || Axfe ||2 converges to zero via Theorem]^ 
Hence, for k sufficiently large, the unit step length satisfies 
the sufficient descent condition (|^. □ 


Proof of Theorem |6] 

Lemma A.l. Suppose f is a closed convex function and 
inf{/(x)|x G dom/} is attained at some x*, IfHk llfor 

X 

some I > 0 and the surrogate problem is solved exactly 
in proximal quasi-Newton method, then X/^ converges to an 
optimal solution starting at any Xg G dom/. 

The proof of this Lemma is shown in ( |Lee, Sun, and Saut> 
|ders 2014 ) 1 . 

Theorem 6. If the problem (|^ is solved by continuation 
SCO, then {x^} generated by the SEP-QN method con¬ 
verges to an optimal solution x* starting at any Xg G dom/. 

Proof. From Algorithm]^ SEP-QN use SCO to solve the 
local proximal of the composite functions, and the adaptive 
Hessian strategy keep H/^ (I for some / > 0. By contin¬ 
uation SCD, the surrogate problem ([^ would be solved ex¬ 
actly. Based on Lemma|A^and Assumptionj^ x^ converges 
to an optimal solution x* starting at any Xg G dom/. □ 

Proof of Theorem 0 

Theorem 7. Suppose g is twice-continuously differentiable 
and strongly convex with constant I, and is Lipschitz 
continuous with constant L2. If Xg is sufficiently close to x*, 
the sequence {Hfc} satisfies the Dennis-More criterion, and 
11 A Hk A LI for some 0 < I < L, then SEP-QN with the 
continuation SCD converges superlinearly after sufficiently 
many iterations. 

Proof. After sufficiently many iterations, /3 « 1. Under the 
Dennis-More criterion, we can show that the unit step length 
satisfied the sufficient descent condition (|^ via the argument 
used in the proof of Theorem Then we have, 

||Ua:fe+i - x*||2 = ||xfc -f Axfe - x*||2 

= ||xfc -f Axfe -f Axfc* - Axfc‘ - x*||2 

< ||xfc -f Axf - x*||2 + ||Axfe - Axfe*||2 

(A.4) 





















According to Theorem 3.4 in (Lee, Sun, and Saunders 2014 1 , 
the proximal Newton method convergence quadraticly, that 
is. 


llx^^i-x*||2<§||xf-x*||2 (A.5) 

By continuation SCD, the surrogate problem (1^ would be 
solved exactly. We draw the same conclusion as ( |Lee, Surp 
[and Saunders 2014) l, that is, 

II Axfe - Ax^*|| 2 < CillXfc - X*||2^ ||Axfc||2 + o(||Axfc||2) 

II Axfclls < C2 ||Ax"‘|| 2 = C2||x^^i - Xfella 
< 0(||xfe - x^lla) + C2(||xfc - x*||2) (A.6) 

From ( |A.4| ), ( |A.5| l and ( |A.6[ ), we conclude that, 

||Xfc+l - X *||2 < ^||xf - X*||i + 0 (||X, - X*|| 2 ). 


Because proximal Newton method convergence much 
quickly, we deduce that x^ converges to x* superlin- 
early. □ 








